

         Notes on Disc Structures and Allocation Strategies for NewCore


Author:    Mike Challis

History:

  0.01  27-Oct-93  Document started
  0.02  29-Oct-93  Version distributed for comment to WS, JRe, KW, JRo


1  Summary

This document contains notes on possible disc structures and allocation
strategies for a potential replacement for FileCore, referred to here as
"NewCore".

At present, these notes concentrate on the requirements of large hard discs;
floppy requirements may well need to be considered separately.



2  Features of FileCore

[ Notes: 

  This is not intended to be a precise description of the FileCore disc
  structures, but should be sufficient to understand the principles involved.

  Some statements are followed by a bracketted question mark - this indicates
  that I'm not certain that they are precisely true, or that it is the full
  story!  ]

Disc space is allocated in "allocation units" each of size "alloc_size"; this
is a power of 2, and is typically 1k for a 100M hard disc.

Details about allocations are recorded in a "map", which can be considered as
an array of bits; each bit corresponds to one allocation unit of the disc.

A contiguously allocated piece of disc is known as a "fragment".

A "disc object" is a sequence of one or more fragments on the disc. Each disc
object has an identifier which I shall call its "object_id" (the PRM rather
confusingly calls this the fragment identifier).

The bits in the map that correspond to an allocated fragment form a "map
entry" with the following format:

       -----------------------
      | object_id | padding |1|
       -----------------------

    where:
       - the object_id is that of the disc object of which this fragment is
         a part
       - the "padding" is an optional sequence of zero bits
       - the '1' bit terminates the map entry

This means that there is a minimum size for fragments determined by the
number of bits needed to store the object_id; this is called "min_frag_size",
and is 14k for my 100M hard disc (ie 13 bits needed to represent object_ids).

Files and directories - or, simply, "objects" - are held in disc objects.

A disc object is either "unshared" - in which case it contains just one
object - or is "shared" - in which case it may contain more than one object.

Each object has an "object address" which is composed of two parts:
   - the object_id of the disc object containing it
   - one plus the address of the sector at which it starts, if it is shared,
     or zero otherwise, known as the "sector_num"

The map is split into "map blocks", each of which is one sector in size. The
part of the disc which is mapped by a particular map block is known as a
"zone"; for example, my hard disc contains 25 zones.

There is a limit to the number of fragments that any zone can contain, and
this is known as the "ids_per_zone". This fact is used to assign object_ids
in such a way that the zone containing the first fragment of a disc object
can be calculated directly from its object_id (as, eg, object_id DIV
ids_per_zone).

The location of this fragment within the zone is found by scanning the map
block looking for the first map entry for a fragment with the correct
object_id.

The location of any further fragments of the disc object are located by
continuing the scan looking for further map entries with the same object_id.
The scan proceeds through the map, cycling round to the beginning again if
necessary. This means that the fragments of a disc object appear in the same
physical as logical order.


Characteristics of this disc structure are:

  - Fragments can be moved in various ways without altering the address of
    the disc object to which they belong:

      - The first fragment of a disc object can be moved within its own zone
        provided that it does not move beyond any further fragment of the
        same disc object.

      - Subsequent fragments can be moved - even across zones - provided that
        the physical order of the fragments that make up the disc object is
        not affected.

    In particular, it is possible to coalesce all of the fragments of a
    multi-fragment object without altering its address.

  - On the other hand, when an object in a shared disc object is moved, its
    object address changes.

  - The structures are optimised for sequential access: random access to
    large multi-fragment files may involve significant amounts of disc
    searching.

When an object address changes, the object's directory entry must be updated,
and so FileCore only permits limited forms of sharing:

  - The files immediately inside a directory can be shared with that
    directory

  - Files with the same parent directory can share (?)


Information on FileCore's allocation and compaction strategies is hard to
come by (probably a full understanding of the code is the only way forward),
but I believe the following to be true (?):

  - Directories never share.

  - A complex strategy for allocation ensures that "small compactions happen
    in response to particular allocation needs and opportunities rather than
    compacting a whole zone at once" [quote from Arthur2 design document].

  - A directory is always at the beginning of a disc object.

  - Shared disc objects are never multi-fragment.

  - A file may be reallocated when it is opened for update, and when it is
    closed.


3  FileCore limitations

The following list gives some of the reasons why FileCore is not able to
support large discs adequately.


3.1  Limited object_id range

The maximum length for an object_id is 15 bits - this is "built-in" to the
code.

Since min_frag_size >= (disc size)/2^(object_id_len), this generates large
min_frag_size values for large discs; for example, an 8G disc would have a
min_frag_size of at least 256k.

Such a large fragment size will result in wasted space when only a directory
and its immediate files can coexist in a single fragment.


3.2  Limited sector offset range

The sector offset occupies 8 bits in an object address, and so is limited to
512 sectors; this corresponds to a maximum min_frag_size of 256k, so there
would still be an addressing problem even if better utilisation could be made
of large shared fragments.


3.3  Map size

Ignoring object_id length limitations, we find there is a trade-off between
map_size and alloc_size/min_frag_size for a given disc size; for example for
an 8G disc we have, roughly:

                map_size        alloc_size      min_frag_size

                  290k              4k                72k
                  137k              8k               136k
                   64k             16k               272k

In other words, halving the map_size doubles the alloc_size and the
min_frag_size.

FileCore expects the entire map to be held in RAM, and so eventually a large
map size becomes unacceptable for this reason; there will also be performance
hits on those algorithms which search through the map.


3.4  No support for new filing system features

NewCore is expected to support both long filenames and directories with
arbitrary numbers of objects inside them. Both these features mean that
directories will need to be variable-length, and may also become quite
large.

At present, FileCore assumes directories are fixed length (and small).


4  Options

The preceding considerations suggest that some kind of two-level disc
structure is needed, with the following overall properties:

  A "level 1" where:
    -  Map and free space information for the whole disc can be held in RAM,
       thus making it possible to identify areas for allocation without
       accessing the disc;
    -  Large chunks of contiguous disc space can be allocated to large files.

  A "level 2" where:
    -  small files and directories share space within objects allocated by
       level 1

This is, indeed, the FileCore approach, but FileCore assumes that:
  - Moving level 1 objects around the disc is a reasonable activity (ie for
    compaction);
  - Level 1 fragments can be made small enough so that little space is
    lost if they cannot be shared.


Two ways forward suggest themselves:

  a)  Retain a small min_frag_size, but introduce new techniques to handle
      the large maps that result.

  b)  Retain a small map_size, but introduce new techniques for improving
      sharing within the large fragments that arise.

These are discussed further in sections 5 and 6 below.

A further ingenious suggestion due to Jonathan Roach is to use the *same*
techniques (and code) at both levels: that is, use a top-level FileCore to
allocate "sub-discs" (which may or may not be contiguous), which are then
recursively processed by a "second-level" copy of the same code. My concerns
with this approach relate to allocation strategies - there are too many ways
to do things! Imagine wishing to allocate space for a new "small" file when
there is insufficient space in the sub-disc containing its directory - do you
increase the size of the sub-disc? create a new sub-disc? does it depend on
whether the sub-disc is contiguous or not?

I favour an approach which distinguishes clearly between the two levels, thus
being able to provide optimal facilities and strategies appropriate to each
level.


5  Paging the map

To provide a concrete example, suppose that we wish to maintain a minimum
fragment size of 16k for an 8G disc. We can do this with the following
FileCore parameter values:

      min_frag_size = 16k
      object_id length = 19 bits
      alloc_size = 512 bytes
      map_size = 2M
      number of zones = 4k

However, the map is now too big to hold in RAM, and so will have to be paged
in when required.

In practice, most allocation operations will result in only localised access
to map blocks, but some operations could become unacceptably protracted; for
example, consider looking for the next fragment of a multi-fragment disc
object when a large contiguous file lies in the way:

      <----frag1---->    <----very large file ... ---->    <--frag2-->

The search would cause every map block with entries for the very large file
to be fetched - unnecessarily - from disc; the same would be true when
looking for some free space to allocate a new file.


However, both these problems can be resolved by holding more information at
the top level as follows:

   - Maintain a table containing the total free space in each zone - say
     16 bits for each of 4k entries gives 8k bytes.

     This makes it possible to locate a zone - and hence just one map block -
     with free space in it without reading any other map blocks.

   - Add an extra field to each map entry to hold the number of the zone in
     which the next fragment of a multi-fragment object can be found.

     This field does not affect the size of the map, but increases the size
     of the minimum fragment. In this case, 12 bits are needed to address
     a zone, and so min_frag_size increases by 12*512 = 6k.

     However, we can restore this to a value less than 16k by choosing a
     new set of parameters:

       min_frag_size = 9k
       object_id length = 20 bits
       alloc_size = 256 bytes
       map_size = 4M
       number of zones = 8k

With these changes it's possible to find space to allocate without accessing
the map, and it's possible to follow the chain of fragments that constitute
a multi-fragment disc object directly. On the other hand, movement of any
fragment of a disc object is now restricted to the zone in which it appears.


There are many unknowns with respect to this approach:

  - Will the FileCore allocation strategies work satisfactorily with very
    large maps?

  - Will a relatively small cache of map blocks be sufficient to ensure
    adequate performance?

  - Could the necessary changes be made to FileCore, or would a rewrite be
    necessary anyway?

In some ways this is the simplest proposal - ie the smallest change to the
status-quo - but the one that is most difficult to validate as workable.


6  Better sharing

FileCore object addresses are indirected only at the top level, by use of the
object_id; if an object is shared, its position within the top level disc
object is given as a direct sector offset.

Suppose, instead, that we include a small table inside each shared disc
object which has an entry for each possible object that might be stored
inside it; for the sake of argument I show this at the head of the fragment:


       -------------------------------------------------------
      | 0|46| 0|23| ...    <--obj3--> .... <--obj1-->         |
       -------------------------------------------------------
        0                 23              46

  This diagram shows the first four entries of the table; objects with sub_id
  equal to 0 or 2 do not exist here, and those with sub_ids 1 and 3 start at
  offsets 46 and 23 respectively.

If we use the table entry offset in place of a direct sector offset, we can
move objects within a shared fragment without altering their object
addresses.

We also need to keep a free space map (one bit per sector) to locate free
space, and we are then able to share any mixture of objects that we wish to
within a single fragment.

On the other hand, we don't want to have to "open up" a shared fragment in
order to find out how much free space it contains, and so we add a new field
to the map entry to hold the total number of sectors of free space within
that fragment.


As a concrete example, consider the following parameters for an 8G disc:

      min_frag_size = 432k
      object_id length = 16 bits
      number of bits to hold sectors of free space = 10 bits
      alloc_size = 16k
      map_size = 64k
      number of zones = 128
      zone size = 64M

If we limit the size of a shared fragment to 512k, then each fragment can
hold a maximum of 1024 objects (each one sector in size), and so the table
needs to have just 1k entries, each one 10 bits long. The free space map is
just 1k bits; so less than 2k of the 512k fragment is taken up with this
extra indirection information.


This approach is attractive because the two levels are clearly delineated,
and allocation strategies appropriate to each can be implemented. For
example, we can characterise the differences between the two levels as
follows:

  level 1:
        - Large fragments, large unit of allocation
        - One disc object can be multi-fragment
        - One fragment can be shared (at level 2)
        - Fragments need not (and should not) usually be moved*
        - Free space and indirection information - the map - fits into RAM
        - Divided into zones

      * If level 1 fragments were *never* moved we could encounter a
        situation where a large file could not be extended because there
        was no free space between its last and first fragment.

  level 2:
        - Small sub-units, allocation unit is the sector
        - All objects are stored as contiguous
        - Sectors are not shared between objects
        - Objects will be moved to make space available
        - Free space and indirection information is held inside the level 1
           fragment and so has to be fetched from disc

On the minus side, there is still a wide variety of options to consider when
allocating space - many of the remarks made above about the "recursive"
approach also apply here - and designing a credible allocation strategy is
proving difficult.


7  Miscellaneous points


7.1  Distributed map blocks

FileCore stores all of its map blocks in the centre of the disc; this means
that any write to the map will cause the disc to seek back to the centre.

The "map paging" proposal would certainly benefit from a distributed map -
perhaps each map block should be held in the middle of the zone to which it
refers.

There is less to be gained from this approach with the "better sharing"
scheme, since the tables inside each shared fragment are, effectively, a
distributed map.

It's also worth noting that the free space information "slaved" inside a
special table (map paging) [or inside the map entries (better sharing)] need
not be written back to disc every time it changes, since it can easily be
reconstructed if necessary when it is found to be out-of-date.


7.2  No free space chain

In a FileCore map block, any free space is represented by map entries that
are chained together: the distance to the next free space map entry is
stored where the object_id would normally have been held. This has two
unfortunate side effects:

  - There is a minimum size for a free space fragment, which can sometimes
    lead to long term "lost" free space:

      Suppose two fragments abut each other, and that removal of object(s)
      from the first means that some free space becomes available at the end
      of it; this can only be represented as a free space fragment if it is
      large enough.

      Suppose it is not, but that later on the second fragment is deleted
      thus making a free space fragment immediately after the first. There
      is no way for FileCore to determine that this can be coalesced with
      the end of the first fragment, and so that free space is "lost" until
      a further change in the first fragment's allocation occurs.

  - The search for a map entry representing an allocated fragment has to go
    hand-in-hand with a trek along the free space list, since otherwise a
    free space fragment whose "next" offset just happens to be the same as
    the required object_id might be picked up instead.

One solution to this is to delimit map entries by a '1' bit at the start as
well as at the end; then all zero bits between map entries would represent
free space. There are two minor disadvantages to this:

  - Scanning the free space list takes a little bit longer.

  - min_frag_size is increased by alloc_size.


7.3  Physical characteristics

FileCore appears to make the following assumptions about a disc's behaviour:

  - Maximum transfer rate is attained when reading/writing a sequence of
    consecutive sectors [presumably the disc is formatted in such a way as
    to ensure this is true].

  - "Seek time" from one sector to another within the same zone is "fast".

  - Seeking between zones is "slow".

When a disc has many heads, (almost) maximum transfer rates can also be
obtained by accessing one sector from one track, and the next sector from
another track in the same cylinder:

     head 1:    <---01---> <---02---> <---03--->
     head 2:     <---12---> <---13---> <---14--->
     head 3:      <---23---> <---24---> <---25--->

  Reading sectors 01, 13, 25 is almost as fast as reading 01, 02, 03.

Perhaps NewCore should try harder to equate "zones" or "chunks" (see below)
with "cylinders" or "cylinder groups" so as to be able to take advantage of
this behaviour.



8  Fixed size chunks

The attractions of a scheme where level 1 fragments can be allocated to be
just the size that is required (above the minimum, that is) seemed to be
obvious until I tried to develop a coherent allocation strategy. Immediately
questions like the following arose:

  - How large should a shared fragment be allowed to grow before some object
    within it is moved to another fragment?

  - How do we deal with "small" holes of space between allocated fragments?
    Since level 1 fragments are now large, it is much less attractive to
    simply move them around to compact and release such space.

In the interests of simplicity, it seems worth revisiting a simpler level 1
structure, where all of the fragments are the same size; to avoid confusion
I shall call fragments with this property "chunks".


As a concrete example, consider the following parameters for an 8G disc:

       chunk size = 512k
       number of chunks = 16k
       number of bits needed to identify a chunk = 14 bits
       number of bits needed to identify an object within a chunk = 10 bits

An object address has the form:

       <---chunk_id---><---sub_id--->
           14 bits         10 bits

where the chunk_id identifies directly the chunk containing the object, and
the sub_id - if greater than zero - identifies an entry in the (shared)
chunk's indirection table, where the sector offset of the start of the object
is held; a sub-id of zero indicates that the obect is not shared.

This format reflects the assumption that top-level disc objects are unlikely
to move, whereas second-level objects will need to be mobile; this is,
intriguingly, the exact converse of FileCore's current format.

Top level map information would have one entry for each chunk giving:

  - is it allocated or not?
  - is it shared or not?
  - if shared, how much free space (in sectors) does it contain?
  - if not shared, what is the address of the next chunk for this object, if
     any?

Other information might also be needed by particular allocation strategies,
such as:

  - if shared, how many directories are stored within it?

Supposing we allow 8 bits for the directory count, this results in around 24
bits per entry - say 48k for the entire map information.

We could divide this map up into 96 zones (each map block just one sector
in size) and distribute it across the disc; each zone would be around 86M
in size.

Within each chunk, the second level mapping table has one entry for each
possible shared object giving its offset within the chunk - or zero if it
is not currently in use; a total of 1k * 10 bits, or certainly less than 2k.


Now that all chunks are the same size, care must be taken to make sure that
little space is wasted simply because the chunk size and object size do not
match. This means, for example, that it must be possible to share the final
chunk of a multi-chunk object with other objects; in such circumstances, this
final portion is identified by the zero'th entry in the chunk's indirection
table.


Fixed size chunks are attractive for a number of reasons:

  - Chunk size can be optimised to match the physical characteristics of the
    hard disc.

  - Top level map information is smaller.

  - There are less choices to consider when developing an allocation
    strategy.

On the other hand:

  - It's more difficult to coalesce the various chunks of a multi-chunk file,
    and this may prove to be a performance issue for large files even though
    the chunk size is large.



9  An allocation strategy

This section is a first attempt to describe a credible allocation strategy,
and assumes the "fixed size chunk" disc structure model.


9.1  Principles

"Small" and "medium" files will be held in shared chunks, and "large" files
in (chains of) unshared chunks. We try to keep these in separate areas of the
disc as follows:


  |-- large files -->      <- small files ->      <-- large files --|
   ^                               ^                               ^
  disc start                  disc centre                   disc end

We hope to be able to allocate most small files in the same shared chunk as
their parent directory, and to stand a fighting chance of doing this we want
to avoid storing large files there too - hence an explicit decision to store
them "far away".

Choosing the centre of the disc for small files and their directories, and
the two extremes for large files, should minimise seek times for locating
large files.


To take account of the potential future space requirements of a directory, we
introduce the concept of "spoken-for space"; this is space that has been
"pencilled-in" for the small files of a directory, even though it may not yet
actually have been used. For example, we may decide when allocating space for
a new directory that we wish to pencil in 30k for future expansion.

The allocation strategy needs access to both the "free space" and the
"unspoken-for space"* inside each shared chunk - so both these values are
held in the top-level map information.

  [ * This is in place of the "number of directory entries" suggested in the
      section above; it does not make a significant difference to the size of
      the map information. ]

The idea is that the spoken-for space is reserved for small files that have a
right to be in the chunk (ie whose parent directory is there), but is not
used for new directories, or for medium files, or for small files that have
"overflowed" from some other chunk.


9.2  General points

In the following descriptions, a "nearest chunk" can be one with lower or
higher chunk_id.

Space allocated from a shared chunk is either spoken-for space or not. In the
former case, both the unspoken-for and free space values are updated; in the
latter case, just the free space value is updated. (Similar remarks apply
when space is released by file deletion, not covered here).

Files are classified into three sizes:

  - "small" files: which are expected to share chunk space with their parent
     directories, and for which spoken-for space is reserved;

  - "medium" files: which are still less than one chunk in size, but which
     are not expected to share chunk space with their parents;

  - "large" files: which are larger than one chunk in size.

The following paragraphs describe allocation strategies by listing the order
in which chunks should be considered; the first one found that meets the
stated criteria is used. The space allocated will normally be taken from
the unspoken-for space: cases where this is not so are noted.


9.3  Allocating space for a new small file

 1: The shared chunk C containing the file's parent directory, providing that
    it has sufficient free space; this allocates spoken-for space.

 2: The shared chunk nearest to C which has sufficient unspoken-for space.

 3: The chunk nearest to C which is not yet allocated.

 4: The shared chunk nearest to C which has sufficient free space (ie now
     ignoring spoken-for space requirements).

 5: Give up.


9.4  Allocating space for a new medium file

 1: The shared chunk C containing the file's parent directory, providing that
    it has sufficient unspoken-for space.

 2: The shared chunk nearest to C which has sufficient unspoken-for space.

 3: The chunk nearest to C which is not yet allocated.

 4: The chunk C, provided it has sufficient free space.

 5: The chunk nearest to C which has sufficient free space.

 6: Give up.


9.5  Allocating space for a new directory

Determine the initial size for the directory itself. This amount, called the
directory's "size", might be determined as follows:

  - calling program specifies the amount (eg *copy), or:
  - estimated from calling program's requested number of directory entries:
        eg num_entries * 40 bytes, or:
  - 10 entries * 40 bytes = 400 bytes, say one sector.

Determine the amount of space that should be pencilled-in for this directory
and its files. This amount, called the directory's "requirement", might be
determined as follows:

  - calling program specifies the amount (eg *copy), or:
  - estimated from calling program's requested number of directory entries:
        eg num_entries * 8k, or:
  - 10 entries * 8k = 80k

 1: The shared chunk C containing the directory's parent directory, providing
    that it has sufficient unspoken-for space for the requirement, and enough
    free space for the size.

 2: The shared chunk nearest to C which has sufficient unspoken-for space for
    the requirement and enough free space for the size.

 3: The shared chunk nearest to C which is unallocated.

 4: The shared chunk C, providing it has sufficient free space for the size;
    if allocated, set the unspoken-for space value to zero.

 5: The shared chunk nearest to C which has sufficient free space for the
    size; if allocated, set the unspoken-for space value to zero.

 6: Give up.


9.6  Allocating space for a large file

For each chunk of the file except the last:

 1: The unallocated chunk nearest to the start/end of the disc.

For the last chunk of the file:

 2: The unallocated chunk nearest to the start/end of the disc. Mark this
    chunk as shared.

If no unallocated chunks remain:

 3: Give up.


9.7  Comments

The strategy above makes no attempt to rationalise space usage between
chunks, and so once space usage becomes "disorganised" - perhaps as a result
of the disc becoming full - it will not automatically re-organise itself.
Only simulation can show whether this is likely to be a real problem.

If it is, one solution may be to limit the areas of disorganisation by
limiting the searches defined in the strategies above to individual zones
of the disc. Other changes would also need to be made, in order to ensure
that directories and files were distributed sensibly amongst the zones.
